Table of Contents:

Cleaning List

What are the expectations from a skills perspective?

Todo

Sandra: Data literacy literature - University is trying to build a scale - if you can do this = level 1, level 2, etc.

Software Development Lifecycle and Research Data Science Life Cycle are different.

MS: Two Diff Types:

What are the skills people are looking for that are different between analyst and scientist.

EDA

From just the word lengths alone, one might be able to infer that data science has more requirements.

I can do a Bayesian analysis to get the posterior distribution and find the probability that the mean sequence length for data science jobs are greater than the mean sequence length for data analysis jobs.

Headings Data

Set Analysis

Ideas for analysis:

Spit out a CSV file with unique header examples. Then we can make dummy variables to see what's in the skill requirements vs what does the job do. It would be interesting to see when we start looking at different job titles, what words pop up in common across them? What's in a data scientist job description that is not in a job analyst job description?

What are words that appear in one vs the other? What is the range of job descriptions that we are pulling out?

Data Analyst has less requirements than Data Scientist, is that actually true? We can look at it by industry and location as well.

Sillicon Valley vs New York. Regional differences for expectations for technical capabilities?

Random Forest Feature Importances

Let's try a cool exercise and try to predict whether or not a job posting will be a data science job posting or a data analyst job posting.

First run:

Feature Feature Importance 281 analyst 0.055493 3873 scientist 0.044333 2526 learning 0.029539 2644 machine 0.022369 3869 science 0.017041 3503 python 0.016804 3685 reporting 0.015142 1651 excel 0.013871 3874 scientists 0.012198 3203 phd 0.010251 2818 ml 0.010073 2826 models 0.009907 458 bachelor 0.008129 4091 spark 0.006223 233 algorithms 0.006188 3686 reports 0.004794 4169 statistics 0.004112 1171 deep 0.004038 1693 experimentation 0.003922 3512 quality 0.003756 3216 physics 0.003279 327 applied 0.003085 3851 scala 0.002976 1562 ensure 0.002878 3391 production 0.002812 3320 predictive 0.002712 4360 tensorflow 0.002711 2427 java 0.002644 1142 dashboards 0.002616 604 build 0.002614